Anomaly detection by correlated metrics

ABSTRACT

A system is configured to detect a small, but meaningful, anomaly within one or more metrics associated with a platform being monitored. The system displays visuals of the metrics so that a user monitoring the platform can effectively notice a problem associated with the anomaly and take appropriate action to remediate the problem. Moreover, the system uses an ensemble of machine learning algorithms, with a multi-agent voting system, to detect the anomaly. Therefore, via the display of the visuals and the implementation of the machine learning algorithms, the techniques described herein provide an improved way of representing a large number of metrics (e.g., hundreds, thousands, etc.) being monitored for a platform. Moreover, the techniques are configured to expose actionable and useful information associated with the platform in a manner that can be effectively interpreted by a user.

PRIORITY APPLICATION

This application claims the benefit of and priority to U.S. Provisional Application No. 62/897,1664, filed Sep. 6, 2019 and entitled “Dynamic Visualization of Metrics and Model Construction,” the entire contents of which are incorporated herein by reference.

BACKGROUND

Machine learning is increasingly being utilized to perform a variety of tasks where patterns and inferences may be analyzed to monitor for problems that can arise in a system, such as a platform that hosts a website. More specifically, anomaly detection algorithms can be used to detect meaningful anomalies (e.g., deviations) to a set of metrics associated with the platform. Many of these anomalies are easily detected and/or noticed because they are represented by large dips or large spikes in the metrics being monitored. For instance, a metric being monitored may experience an unexpected and sharp decrease in data values over a short period of time (e.g., thirty seconds, two minutes, five minutes, etc.). Or, a metric being monitored may experience an unexpected and sharp increase in data values over a short period of time.

However, conventional anomaly detection algorithms are largely ineffective with regard to detecting small, but meaningful, anomalies that may be helpful in identifying a problem within a system before more serious consequences are realized. Consequently, conventional anomaly detection algorithms are unable to detect what may be referred to as a “slow bleed” anomaly, or a scenario where an actual data value for a metric being monitored is slightly different than an expected data value for the metric, but this slight difference may be a meaningful signal of a growing problem a platform is experiencing.

Additionally, conventional systems that implement these anomaly detection algorithms lack the ability to display information being generated in a way that enables a user monitoring hundreds of metrics associated with a platform to effectively interpret the information and take appropriate action to remediate a problem.

It is with respect to these and other technical considerations that the disclosure made herein is presented.

SUMMARY

The techniques described herein are configured to detect a small, but meaningful, anomaly within one or more metrics associated with a platform being monitored. The techniques are configured to display visuals of the metrics so that a user monitoring the platform can effectively notice a problem associated with the anomaly and take appropriate action to remediate the problem. Moreover, the techniques use an ensemble of machine learning algorithms, with a multi-agent voting system, to detect the anomaly. Therefore, via the display of the visuals and the implementation of the machine learning algorithms, the techniques described herein provide an improved way of representing a large number of metrics (e.g., hundreds, thousands, etc.) being monitored for a platform. Moreover, the techniques are configured to expose actionable and useful information associated with the platform in a manner that can be effectively interpreted by a user.

The techniques described herein may be used in a scenario where one or more users are tasked with monitoring a set of metrics related to a platform, in real-time, in order to triage potential problems and/or take appropriate remedial actions. For example, users of a Site Reliability Engineering team may be tasked with monitoring hundreds or thousands of metrics related to a platform that supports a website. The metrics being monitored may relate to a particular function or process of the platform, such as the ability for customers to purchase items via an electronic commerce site. In a more specific example, the metrics being monitored may relate to one or more of a customer being able to sign in, a customer being able to check out, and so forth. The techniques may be used in other scenarios as well.

In some embodiments, a broader category of metric, such as “checkout”, can be divided into multiple sub-category metrics (e.g., “checkouts_completed” and “checkouts_cancelled”). Moreover, a metric may include varying attributes. For instance, the attributes can relate to a location in which a customer is purchasing an item from a website, a type of device being used to purchase the item, a payment method being used to pay for the item, and so forth. Accordingly, the set of metrics being monitored can be expansive and can quickly stretch into the hundreds or thousands depending on the function or process of a platform being monitored, a number of categories and/or sub-categories that compose the metrics, and/or a number of attributes that can vary from one metric to the next.

To further illustrate how attributes can vary, a “checkouts_completed” metric can have individual metrics for check outs by customers in different countries, such as the United States of America, Great Britain, Australia, France, Russia, South Africa, Brazil, etc. The “checkouts_completed” metric can additionally have individual metrics for check outs performed using different types of devices, such as IPHONE devices, ANDROID devices, and/or other types of devices. The “checkouts_completed” metric can further have individual metrics for check outs in which items are paid for using different payment methods, such as PAYPAL, a credit card, and/or other types of payment methods. An individual metric can include one or multiple attributes, and one varying attribute may result in a different metric being monitored. For example, the following are different “checkout” metrics that may be monitored:

-   -   an “checkouts_completed_overall” metric (e.g., all checkouts         implemented using the platform regardless of location, device,         and payment method),     -   a “checkouts_completed_USA_IPHONE_PAYPAL” metric,     -   a “checkouts_completed_AUSTRALIA_IPHONE_PAYPAL” metric,     -   a “checkouts_completed_USA_ANDROID_PAYPAL” metric,     -   a “checkouts_completed_USA_IPHONE_creditcard” metric.

Based on the description above, one can see that the number of metrics being monitored can be large and can quickly grow as the platform supporting a website expands to accommodate different locations, different devices, different payment methods, and/or other distinguishing attributes. The system described herein helps reduce the complexities in monitoring a large number of metrics for anomalies by generating and displaying multiple visuals that represent data (e.g., real-time or near real-time data values) associated with the metrics being monitored. Using the data associated with the metrics being monitored, the system can generate and display a first visual to provide a high-level indication of whether the platform is healthy. A platform may be healthy when the data values associated with the metrics are not deviating from expected or predicted data values in a meaningful manner. The first visual can be updated in real-time as the data associated with the metrics continues to be received by the system.

Technical advantages of user interfaces described herein include the ability to monitor hundreds or thousands of metrics in real-time using a visual, such as a radar-based visual. This leads to scalability and efficiency because so many metrics may be monitored and displayed at once. Because traditional methods of metrics display may be avoided, this also leads to a more memory and CPU efficient method of metrics display. Still further, because the visualization of many metrics is possible in a compact form, the visual may also be technically advantageous for monitoring many metrics on a compact computer device, such as a mobile phone with a limited size display screen. In addition, the user is more efficient because the user can quickly check a visualization of many metrics, such as the radar-based visual, and the user can accurately link to any metric that shows a problem.

In various examples, the first visual may be a radar-based visual on which an object (e.g., a dot) derived from the underlying data can be plotted. The first visual can be generated using an unsupervised artificial neural network algorithm that projects high-dimensional data onto a two-dimensional map. For instance, the first visual may be generated using a custom modification of self-organizing map (SOM). The self-organizing map may be built using a broader category of metrics, such as those related to a customer checking out.

An object plotted on the radar-based visual can represent data associated with a combination of metrics related to an aspect of the platform being monitored (e.g., “checkouts_completed”). The way in which the radar-based visual is configured for display enables the object to signal anomalous activity associated with the combination of metrics. For example, a location of the object is determined based on matching current, real-time data associated with the combination of metrics to a closest observation of historic data of the combination of metrics. Accordingly, regions towards the center of the radar-based visual may represent more common scenarios that are frequently observed for the data. Regions towards the periphery of the radar-based visual represent more uncommon scenarios that are less frequently observed for the data. As a result, anomalous activity may be signaled as the plotted location of the object moves toward the periphery of the radar-based visual over time. In contrast, if the plotted location of the object remains towards the center of the radar-based visual, then the plotted location of the object may signal normal activity associated with the combination of metrics. The movement of the object from the center towards the periphery of the radar-based visual can be viewed by a user over time, and the user can use his or her judgment to determine whether an issue should be further investigated based on the location of the object.

The radar-based visual can include different color regions and/or shades to help signal a problem. For example, the inner part of the radar-based visual around a center point may be different shades of blue (e.g., darker shades of blue toward the center that turn to lighter shades when looking outward toward the periphery). The outer part of the radar-based visual may be different shades of red (e.g., darker shades or red toward the periphery that turn to lighter shades when looking inward toward the center). A user can notice when the object starts to move from a darkly-shaded blue region near the center of the radar-based visual, which can be a strong signal of normal activity for the combination of metrics represented by the object, towards a darkly-shaded red region near the periphery of the radar-based visual, which can be a strong signal of anomalous activity for the combination of metrics represented by the object.

Another signal of an anomaly can include an increase in a size of the object (e.g., the size of a dot). A size of the object represents a degree to which the real-time data for the combination of metrics is anomalous to the observed historic data. It is possible that that the object may be located near the center of the radar-based visual, yet one or more of the real-time metric values is foreign and is not commonly seen in the historical data. In a specific example, a large number of metrics being monitored may have normal values that closely match historical observations, but a small number of metrics (e.g., one, two, three, etc.) being monitored may have real-time values that are not commonly observed or that have never been observed before (e.g., foreign values). In this situation, the increased size of the object can be used to signal the anomaly because the location of the object is likely to be plotted towards the center of the radar-based visual due to the normal observations of the larger number of metrics.

In some examples, the movement of a plotted object and a size of the plotted object can work together to signal an anomaly. This may be useful because a plotted object in a red region alone may occur due to an impact of a peak period, a slow period, or some other factor that may signal an abnormality but may not entirely be abnormal. Accordingly, a size of the plotted object can provide a strong indication of the anomalous behavior because the size of the plotted object can indicate that this observation has not occurred before during training.

Using the data associated with metrics being monitored, the system can also generate and display a second visual. The second visual enables a user to localize a problem to a smaller group of metrics being monitored. Stated another way, the second visual allows a user to determine which metrics, out of hundreds or thousands of metrics being monitored, are experiencing anomalous activity.

In various examples, the second visual may be a tree map visual that includes a plurality of sections. Each section in the tree map visual can be associated with a specific attribute used to compose one or more of the metrics being monitored (e.g., the “checkout” metrics). A size and/or a color of an individual section can be used to indicate anomalous activity for the specific attribute. As described above, the specific attribute may be a location of multiple different locations from which customers can purchase an item via an electronic commerce site and supporting platform (e.g., the United States of America, Great Britain, Australia, France, Russia, South Africa, Brazil, etc.), a type of device of multiple different types of devices that can be used by customers to purchase an item via the electronic commerce site and supporting platform (e.g., IPHONE, ANDROID, etc.), a payment method of multiple different payments methods that customers can use to purchase an item via the electronic commerce site and supporting platform (e.g., PAYPAL, credit card, etc.), and so forth.

A user can interact with the tree map visual to localize the problem to a smaller group of metrics. For example, the user may notice a larger, red section in the tree map visual which signals anomalous activity for a specific attribute (e.g., the United States of America, IPHONE, PAYPAL, etc.). The user may then select the section associated with the specific attribute. The system receives the user input that selects the section of the tree map visual and updates both the radar-based visual and the tree map visual so that only metrics related to the specific attribute are represented. In other words, a smaller group of metrics, which are related to the specific attribute, are used to rebuild one or both of the radar-based visual and the tree map visual based on the user input.

Consequently, the system described herein provides a real-time, interactive application where a user may initially notice a problem by viewing an object that is growing in size and/or moving towards or in the periphery of the radar-based visual, and then the user can interact with the tree map, or other detailed visualization visual, to better understand the problem. It is to be appreciated that the radar-based visual may show one or more objects, and each object may represent multiple metrics.

In various examples described herein, the radar-based visual that initially provides a high-level overview of the health of the platform can be generated using a modified version of the “self-organizing map” (SOM) algorithm (i.e., “Kohonen's network”). The SOM algorithm achieves low-dimensional views of high-dimensional data. For example, high-dimensional data can be mapped into a two-dimensional data matrix using the algorithm and then the two-dimensional data matrix can be visualized as a heatmap. The modified version of the SOM algorithm accounts for time-series data and alters the output representation so that a symmetric radar-based visual is displayed. Thus, the modified version of the SOM algorithm may be referred to herein as the “self-organizing radar” (SOR) algorithm.

The way in which the radar-based visual is configured enables an object representing a combination of metrics to generally signal anomalous activity as the plotted location of the object moves or is located toward the periphery of the radar-based visual over time. In contrast, if the plotted location of the object remains towards the center of the radar-based visual, then the plotted location of the object generally signals normal activity associated with the combination of metrics. This provides an effective and an efficient way for a user to notice a problem early on before the problem fully develops and causes more serious consequences. Compared to conventional techniques, the user is not required to examine or interpret large amounts of complicated information, such as mathematical equations, numerical values/coefficients, value comparisons to thresholds, multiple charts, etc., in order to determine that a problem exists.

In various examples described herein, the tree map visual can be generated using a multi-agent voting system of various machine learning agents. In one embodiment, an agent is a regression model coupled with a Quantile Loss Gradient Boosted Trees (QLGBT) machine learning model for vote-decision making. The system analyzes the metrics being monitored and determines which metrics are highly correlated to distribute on the agents. For instance, the system can use independent regressors to determine that a first metric and a second metric satisfy a correlation threshold, and therefore are highly correlated and can be assigned to one or more agents. If a first metric and a second metric are highly correlated, a data value for the first metric can be confidently used by an agent to predict a data value for the second metric using the agent's regression method of choice (e.g., polynomial regression). In a more specific example, a historic analysis of data values can be used to identify a correlation in which a data value for a “checkouts_completed” metric can be used to predict a data value for a “checkouts_cancelled” metric. In another specific example, a historic analysis of data values can be used to identify a correlation in which a data value for a “checkouts_completed_France” metric can be used to predict a data value for a “checkouts_completed_overall” metric.

Given two highly correlated metrics, an agent generates a prediction model so that a data value of a first metric can predict a data value for a second metric. A prediction model can be generated for each set of highly correlated metrics by multiple different agents. Provided an actual data value for a particular metric, an agent is configured to compare the actual data value to the data values predicted by the agent's prediction model generated in association with the particular metric. The agent then uses its own prediction error and/or other exogenous factors, such as temporal factors, holiday factors, etc., to generate upper and lower quantile limits, or bounds, on the error using QLGBT. If the error in a predicted data value falls outside a confidence interval (e.g., the upper and lower bounds) when compared to the actual data value, then the agent provides a vote that signals an anomaly. In other words, a vote by an agent means that an error in predicting a data value for a metric was outside an acceptable error range. In various examples, the error is a normalized error (e.g., a percentage) that is agnostic to the scale of the data. In other examples, the error may be an absolute error.

The vote is associated with a particular metric based on whether the error falls outside the upper bound or the lower bound. For example, a first correlated metric may be associated with the upper bound and a second correlated metric may be associated with the lower bound. The agent may use this approach when an anomaly is associated with a decrease in values (e.g., a dip). If the error in predicting the data value is outside the upper bound, the first correlated metric receives the vote and is therefore the voted metric. If the error in predicting the data value is outside the lower bound, the second correlated metric receives the vote and is therefore the voted metric. In another example, the first correlated metric may be associated with the lower bound and the second correlated metric may be associated with the upper bound. The agent may use this approach when an anomaly is associated with an increase in values (e.g., a spike).

Once the votes are received from all the agents associated with a large correlated set of metrics being monitored (e.g., hundreds, thousands, etc.), the system can analyze the agents determined to be associated with the voted metric, and localize a problem to a specific attribute. The system can examine all the agents to determine a total number of agents that share an attribute (e.g., a location such as the United States of America) with the voted metric, without regard to whether the agent voted or not. Moreover, system can examine the voting agents to determine a number of the voting agents that share the same attribute (e.g., the location such as the United States of America) with the voted metric.

If a percentage determined based on the number of the voting agents that share the attribute and the total number of agents that share the same attribute is greater than a predetermined threshold percentage (e.g., a percentage between seventy percent and ninety percent), then the system can determine that the problem is likely localized to that attribute. In a specific example, a threshold can be set to seventy-five percent, and thus, if eighty out of a total of one hundred agents that use a metric related to the United States of America attribute provide a vote signaling an anomaly, then the system can localize a problem to a specific location, i.e., the United States of America.

The system can use this information in the tree map visual to show that there may be a problem with respect to a particular attribute, such as the United States of America. That is, the system can increase the size and/or change the color of the section in the tree map visual that corresponds to the particular attribute. In one embodiment, the size and/or the color of a section is related to a percentage of agents that vote for an error within a specific localization (e.g., a specific attribute).

Using a machine learning model, such as but not limited to the Quantile Loss Gradient Boosted Trees (QLGBT) machine learning model, to model an individual agent's decision within a multi-agent voting system, the system is able to provide higher sensitivity to a “slow bleed” anomaly, or a scenario where an actual data value for a metric is slightly different than an expected data value for the metric, but the slight difference may be a meaningful signal of a growing problem a platform is experiencing. Use of the multi-agent voting system also provides a higher resistance to false positives, because while a false positive in a single agent is likely, the chances of simultaneous occurrences of false positives in a larger number of agents is lower. Moreover, the diversity of the agents with respect to their assigned metrics, exogenous variables, and models used overcomes the bias resulting from the vulnerability of using a single model or limited set of metrics. Accordingly, the machine learning-based multi-agent voting system comprises a problem inference tool rather than a mere outlier detection tool.

While some of the examples described above relate to health monitoring of a website such as an electronic commerce site, the techniques can additionally or alternatively be applied to other examples in which other metrics and/or other platforms are being monitored. For instance, the techniques described herein can be used to: display visuals representing active listings and market status for a seller of items, to display visuals representing the segmentation of search results, or to display visuals representing a post-processing layer on top of a variety of possible artificial intelligence (AI) outputs. Equally, the examples above may be applied to other performance metrics, such as monitoring machine metrics in a data center, such as CPU utilization, memory utilization, disk activity, or other metrics.

It should be appreciated that the subject matter described above and in further detail below can be implemented as a computer-controlled apparatus, a computer-implemented method, a computing device, or as an article of manufacture such as a computer-readable storage medium. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended that this Summary be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The Detailed Description is described with reference to the accompanying FIGS. In the FIGS., the left-most digit(s) of a reference number identifies the FIG. in which the reference number first appears. The same reference numbers in different FIGS. indicate similar or identical items.

FIG. 1 is a diagram illustrating an example environment in which a monitoring system is configured to display visuals so that a user can view a health indication for a platform and localize a problem to one or more metrics and/or attributes being monitored.

FIG. 2A is a diagram that illustrates a radar-based visual and a tree map visual, and how the radar-based visual and the tree map visual can be used to localize a problem with a platform to a specific metric and/or a specific attribute being monitored.

FIG. 2B is a diagram that illustrates a radar-based visual and a tree map visual, and how the radar-based visual and the tree map visual can be updated based on a problem that is localized to a specific attribute.

FIG. 3 illustrates an example of how the Quantile Loss Gradient Boosted Trees (QLGBT) model-based multi-agent system can be used to produce information to be represented via the tree map visual.

FIG. 4A is a diagram that illustrates an example of how the multi-agent system can be used to identify an attribute that may be associated with anomalous activity.

FIG. 4B is a diagram that illustrates an example of how a number of votes for an attribute can be compared to a threshold to determine that the attribute is likely associated with anomalous activity.

FIG. 5 is a flow diagram showing aspects of an illustrative method, according to one embodiment disclosed herein.

FIGS. 6A-6B is a flow diagram showing aspects of an illustrative method, according to one embodiment disclosed herein.

FIG. 7 is a computer architecture diagram illustrating aspects of an example computer architecture for a computer capable of executing the software components described herein.

DETAILED DESCRIPTION

The following Detailed Description presents technologies for generating visuals which are useable to indicate a problem associated with the health of a platform and to enable a user to identify a specific metric and/or a specific attribute with which the problem is associated. The disclosed technologies can enhance the functionality and efficiency of various machine learning systems and related applications.

FIG. 1 is a diagram illustrating an example environment 100 in which a monitoring system 102 is configured to display visuals so that a user 104 can view a health indication for a platform 106 and localize a problem to one or more metrics and/or attributes being monitored. In one configuration, the monitoring system 102 can be part of the platform 106. Alternatively, the monitoring system 102 can be a separate system that can be called upon by the platform 106 to implement the monitoring described herein.

In one example, the platform 106 can include resources (e.g., processing resources, networking resources, storage resources, etc.) that support a website, such as an electronic commerce site. Moreover, the user 104 may be part of a Site Reliability Engineering team tasked with monitoring hundreds or thousands of metrics 108, in real-time, in order to triage potential problems associated with the platform 106 and/or take appropriate remedial actions. The metrics 108 being monitored may relate to a particular function or process supported by the platform 106, such as the ability for customers to purchase items via an electronic commerce site.

While some of the examples described herein relate to health monitoring of a website such as an electronic commerce site, the techniques can additionally or alternatively be applied to other examples in which other metrics and/or other platforms are being monitored. For instance, the techniques described herein can be used to: display visuals representing active listings and market status for a seller of items (e.g., the user 104 in FIG. 1 can be a person that lists items on an electronic commerce site), to display visuals representing the segmentation of search results, or to display visuals representing a post-processing layer on top of a variety of possible artificial intelligence (AI) outputs.

FIG. 1 illustrates various devices 110(1)-110(N) (may be referred to herein as devices 110) that may interact with the platform 106 to implement the particular function or process supported by the platform 106 (where N in the context of FIG. 1 is a positive integer number that can be hundreds, thousands, hundreds of thousands, etc.). The devices 110 are each configured to generate and send data associated with the monitored metrics 112(1)-112(N) to the platform 106 and/or the monitoring system 102. Accordingly, FIG. 1 illustrates that the devices 110 generate and send the data associated with the monitored metrics 112(1)-112(N) (may be referred to herein as data 112) over one or more network(s) 114, and the data 112 is received by the monitoring system 102.

In various examples, the metrics 108 being monitored may belong to a category of metrics, such as checking out from an electronic commerce site (e.g., “checkout”). Accordingly, the category of metrics can be divided into multiple sub-category metrics (e.g., “checkouts_completed” and “checkouts_cancelled”). Moreover, a metric may include varying attributes. To this end, FIG. 1 illustrates that the data 112 provided by the devices 110 can include attributes 116(1)-116(N) (may be referred to herein as data 116). The attributes 116 may relate to a location in which a customer is purchasing an item from a website, a type of device being used to purchase the item, a payment method being used to pay for the item, and so forth.

More specifically, the metrics 108 being monitored may be based on customers checking out and buying items in different countries, such as the United States of America, Great Britain, Australia, France, Russia, South Africa, Brazil, etc. The metrics 108 being monitored may be based on customers checking out and buying items using different types of devices, such as IPHONE devices, ANDROID devices, and/or other types of devices. The metrics 108 being monitored may be based on customers checking out, buying items, and paying for the items using different payment methods, such as PAYPAL, a credit card, and/or other types of payment methods. An individual metric 108 can include one or multiple attributes 116, and one varying attribute 116 may result in a different metric 108 being monitored. For example, the following are different “checkout” metrics that may be monitored:

-   -   an “checkouts_completed_overall” metric (e.g., all checkouts         implemented using the platform regardless of location, device,         and payment method),     -   a “checkouts_completed_USA_IPHONE_PAYPAL” metric,     -   a “checkouts_completed_AUSTRALIA_IPHONE_PAYPAL” metric,     -   a “checkouts_completed_USA_ANDROID_PAYPAL” metric,     -   a “checkouts_completed_USA_IPHONE_creditcard” metric.

The monitoring system 102 can comprise device(s) (e.g., servers) and/or other components that communicate with one another, with the platform 106, and/or with the devices 110 via one or more network(s) 114. Moreover, the monitoring system 102 can include a self-organizing radar (SOR) module 118 and a voting module 120. The number of illustrated modules is just an example, and the number can vary higher or lower. That is, functionality described herein in association with the illustrated modules can be performed by a fewer number of modules or a larger number of modules on one device or spread across multiple devices.

Using the data 112 associated with the metrics 108 being monitored, the self-organizing radar (SOR) module 118 can be used to generate and display a first, radar-based visual 122 to provide a high-level indication of whether the platform 106 is healthy. A platform 106 may be healthy when the data values associated with the metrics 108 are not deviating from expected or predicted data values in a meaningful manner. The first, radar-based visual 122 can be updated in real-time as the data 112 associated with the metrics 108 continues to be received by the monitoring system 102 over time.

Using the data 112 associated with the metrics 108 being monitored, the voting module 120 can optionally be used to generate and display a second, detailed display, such as a tree map visual 124. The second, tree map visual 124 enables the user 104 to localize a problem to a smaller group of metrics 108 being monitored. Stated another way, the second, tree map visual 124 allows a user to determine which metrics, out of hundreds or thousands of metrics being monitored, are experiencing anomalous activity. As further described herein, the second, tree map visual 124 can be generated using a machine learning model built based on historical data values for the metrics 126. It is to be appreciated that the second visual 124 may be any other visual capable of displaying metrics, such as a detailed grid view, or other graphic.

By generating and displaying the two visuals 122 and 124, the monitoring system 102 provides a real-time, interactive application where the user 104 can initially notice a problem using the first, radar-based visual 120, and then the user 104 can shift his or her focus to the second, tree map visual 124 to obtain a better understanding of the source of the problem. This is further described herein with respect to FIGS. 2A-2B.

Network(s) 114 can include, for example, public networks such as the Internet, private networks such as an institutional and/or personal intranet, or some combination of private and public networks. Network(s) 114 can also include any type of wired and/or wireless network, including but not limited to local area networks (LANs), wide area networks (WANs), satellite networks, cable networks, Wi-Fi networks, WiMax networks, mobile communications networks (e.g., 3G, 4G, and so forth) or any combination thereof. Network(s) 114 can utilize communications protocols, including packet-based and/or datagram-based protocols such as internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), or other types of protocols. Moreover, network(s) 114 can also include a number of devices that facilitate network communications and/or form a hardware basis for the networks, such as switches, routers, gateways, access points, firewalls, base stations, repeaters, backbone devices, and the like.

In various examples, device(s) of the monitoring system 102 can include one or more computing devices that operate in a cluster or other grouped configuration to share resources, balance load, increase performance, provide fail-over support or redundancy, or for other purposes. For instance, device(s) of the monitoring system 102 can belong to a variety of classes of devices such as traditional server-type devices.

A device 110 can belong to a variety of classes of devices, such as server-type devices, desktop computer-type devices, mobile-type devices, special purpose-type devices, embedded-type devices, and/or wearable-type devices. Thus, a device 110 can include, but is not limited to, a desktop computer, a game console and/or a gaming device, a tablet computer, a personal data assistant (PDA), a mobile phone/tablet hybrid, a laptop computer, a telecommunication device, a wearable device, a virtual reality (VR) device, an augmented reality (AR) device, an automotive computer, a network-enabled television, a terminal, an Internet of Things (IoT) device, a work station, a media player, a personal video recorders (PVR), a set-top box, or any other sort of computing device.

FIG. 2A is a diagram that illustrates a radar-based visual 202 and a tree map visual 204, and how the radar-based visual 202 and the tree map visual 204 can be used to localize a problem to a specific metric being monitored.

The radar-based visual 202 is configured to plot objects based on the data 112 received in association with the metrics 108 being monitored. In one embodiment, the radar-based visual 202 can be generated using an unsupervised artificial neural network algorithm that projects high-dimensional data onto a two-dimensional map. In one configuration, the radar-based visual 202 can be generated using a modified version of the “self-organizing map” (SOM) algorithm (i.e., “Kohonen's network”). The SOM algorithm achieves low-dimensional views of high-dimensional data. For example, high-dimensional data can be mapped into a two-dimensional data matrix using the algorithm and then the two-dimensional data matrix can be visualized as a heatmap, table, or other visualization. The modified version of the SOM algorithm accounts for time-series data and alters the output representation so that a symmetric radar-based visual is displayed, and thus, may be referred to as a “self-organizing radar” (SOR) algorithm 206.

In some implementations, the SOR algorithm 206 is adjusted to accommodate time-based expectations for the monitored metrics. Stated another way, separate SOR algorithms 206 can be used for different predetermined time periods. For instance, the SOR algorithm 206 can include adjustments for each hour of each day of a particular week in a year. This accounts for daily and/or seasonal expectations. In further implementations, the SOR algorithm 206 can be updated over time to reflect trends in the metrics being monitored.

The radar-based visual 202 can include different color and/or shaded regions to help signal a problem. For example, the inner part of the radar-based visual 202 around a center point can include regions 208, 210, 212 represented by different shades of a color, such as blue. Region 208 may be a darker shade of blue, which indicates a strong signal of normal activity, while region 212 may be a lighter shade of blue indicating a weaker signal of normal activity. Region 210 may be a shade of blue between the darker shade and the lighter shade.

The outer part of the radar-based visual 202 can include regions 214, 216, 218 represented by different shades of another color, such as red. Region 214 may be a darker shade of red, which indicates a strong signal of anomalous activity, while region 218 may be a lighter shade of red indicating a weaker signal of anomalous activity. Region 216 may be a shade of red between the darker shade and the lighter shade.

An object 220 (e.g., a dot) plotted on the radar-based visual 202 can represent a combination of metrics related to an aspect of the platform 106 being monitored. For instance, the combination of metrics may relate to a “checkouts_completed” category or sub-category, and the individual metrics in the combination may have varying attributes (e.g., different locations, different devices, different payment methods). The way in which the radar-based visual 202 is configured for display enables the object 220 to signal anomalous activity associated with the combination of metrics as the plotted location of the object moves toward the periphery of the radar-based visual 202 over time. For example, if the object 220 moves from the shown location in region 208, to the location in region 216, then this movement signals growing anomalous activity within the combination of metrics. In contrast, if the plotted location of the object 220 remains towards the center of the radar-based visual 202 (within region 208), then the plotted location of the object 220 signals normal activity associated with the combination of metrics. It is to be appreciated that the radar-based visual 202 may display more than one object concurrently.

In some embodiments, a size of the object (e.g., the size of a dot) can increase as the object moves through regions near the center of the radar-based visual 202 towards regions near the periphery of the radar-based visual 202. This is illustrated by the size of the object 220 when it is located in region 208 compared to the size of the object after it has moved to region 216. The size dimension of an object may help determine whether anomalous activity is due to a potentially normal but less frequent event, such as a peak demand period, promotion, or holiday, or may be due to activity that has not been observed before. In other words, the size of an object may indicate how elevated or different the object is from a specific region on the radar-based visual. An elevated object may mean that the radar has rarely or never seen such behavior when it was being trained.

Therefore, a size of the object 220 can represent a degree to which the real-time data for the combination of metrics is anomalous to the observed historic data. It is possible that that the object may be located near the center of the radar-based visual, yet one or more of the real-time metric values is foreign and is not commonly seen in the historical data. In a specific example, a large number of metrics being monitored may have normal values that closely match historical observations, but a small number of metrics (e.g., one, two, three, etc.) being monitored may have real-time values that are not commonly observed or that have never been observed before (e.g., foreign values). In this situation, the increased size of the object can be used to signal the anomaly because the location of the object is likely to be plotted towards the center of the radar-based visual due to the normal observations of the larger number of metrics.

A user can notice when the object starts to move from the blue regions 208, 210, 212 near the center of the radar-based visual 202 towards the red regions 214, 216, 218 near the periphery of the radar-based visual 202. Consequently, the movement, or lack thereof, provides an indication of the health of the platform 106. It is understood, in the context of this disclosure, that the radar-based visual 202 can be segmented to display multiple different objects representing different combinations of metrics simultaneously. Alternatively, the radar-based visual 202 can display objects generated at different time frames on segments associated with these time-frames. Furthermore, the radar-based visual 202 can be segmented to show and compare or contrast different domains on the same radar, such as the business-aspect domain on one half of the visual compared to the infrastructure domain on the second half, or the health of one data-center in contrast with the health of another data-center.

The tree map visual 204 includes a plurality of sections, and each section in the tree map visual 204 can be associated with a specific attribute 224(1)-224(M) used to compose one or more of the metrics being monitored (where M in the context of FIG. 2A is a positive integer number that can be hundreds, thousands, etc.). A size and/or a color of an individual section can be used to indicate anomalous activity for the specific attribute 224(1)-224(M). That is, as an amount of anomalous activity grows, a size of a section will increase and the color may change (e.g., from green to yellow to red).

As described above, a specific attribute 224(1)-224(M) may be a location of multiple different locations from which customers can purchase an item via an electronic commerce site and supporting platform (e.g., the United States of America, Great Britain, Australia, France, Russia, South Africa, Brazil, etc.), a type of device of multiple different types of devices that can be used by customers to purchase an item via the electronic commerce site and supporting platform (e.g., IPHONE, ANDROID, etc.), a payment method of multiple different payments methods that customers can use to purchase an item via the electronic commerce site and supporting platform (e.g., PAYPAL, credit card, etc.), and so forth.

The tree map visual 204 can be generated using machine learning-based agents 226 (e.g., hundreds of agents, thousands of agents, etc.) that constitute a multi-agent voting system 228. In one embodiment, a machine learning-based agent 226 can include a polynomial regression model coupled with a Quantile Loss Gradient Boosted Trees (QLGBT) machine learning model. The monitoring system 102 is configured to analyze the metrics being monitored and determine which metrics are highly correlated to distribute amongst the agents 226. The monitoring system 102 may ultimately determine hundreds or thousands of correlations between the metrics being monitored. For instance, independent regressors can be used to determine that a first metric and a second metric satisfy a correlation threshold, and therefore are highly correlated. If a first metric and a second metric are highly correlated, the assigned agent can confidently predict a data value for the second metric from the first metric. In a more specific example, an analysis of the historic data values 126 may yield a correlation in which a data value for a “checkouts_completed” metric can be used to predict a data value for a “checkouts_cancelled” metric. In another specific example, a historic analysis of data values can be used to identify a correlation in which a data value for a “checkouts_completed_France” metric can be used to predict a data value for a “checkouts_completed_overall” metric.

Given two highly correlated metrics, an agent 226 generates a prediction model so that a data value of a first metric can predict a data value for a second metric. A prediction model can be generated for each set of highly correlated metrics by multiple different agents. Provided an actual data value for a particular metric, an agent is configured to compare the actual data value to the data values predicted by the agent's prediction model generated in association with the particular metric. The agent then uses its own prediction error and/or other exogenous factors, such as temporal factors, holiday factors, etc., to generate upper and lower quantile limits (i.e., bounds) on the error using QLGBT. Other mechanisms, such as a Quantile Loss Deep Neural Network (QL-DNN), can be used to generate the upper and lower bounds as well. If the error in a predicted data value falls outside a confidence interval when compared to the actual data value, then the agent provides a vote that signals an anomaly. In other words, a vote by an agent means that an error in predicting a data value for a metric was outside an acceptable error range. In various examples, the error is a normalized error (e.g., a percentage) that is agnostic to the scale of the data. In other examples, the error may be an absolute error.

The vote is associated with a particular metric based on whether the error falls outside the upper bound or the lower bound. For example, a first correlated metric may be associated with the upper bound and a second correlated metric may be associated with the lower bound. The agent may use this approach when an anomaly is associated with a decrease in values (e.g., a dip). If the error in predicting the data value is outside the upper bound, the first correlated metric receives the vote and is therefore the voted metric. If the error in predicting the data value is outside the lower bound, the second correlated metric receives the vote and is therefore the voted metric. In another example, the first correlated metric may be associated with the lower bound and the second correlated metric may be associated with the upper bound. The agent may use this approach when an anomaly is associated with an increase in values (e.g., a spike).

Once the votes are received, the voting module 120 can analyze the agents assigned to the voted metric, and localize a problem to a specific attribute. For instance, the voting module 120 can examine the agents 226 to determine a total number of agents that share an attribute (e.g., a location such as the United States of America) with the voted metric, without regard to whether the agent voted or not. The voting module 120 also determines a number of voting agents that share the same attribute (e.g., a location such as the United States of America).

If a percentage determined based on the number of the voting agents that share the attribute and the total number of agents that share the same attribute is greater than a predetermined threshold percentage (e.g., a percentage between seventy and ninety percent), then the system can determine that the problem is likely localized to that attribute. In a specific example, a threshold can be set to seventy-five percent, and thus, if eighty out of a total of one hundred agents that use a metric related to the United States of America attribute provide a vote signaling an anomaly, then the system can localize a problem to a specific location, i.e., the United States of America.

Further, the voting module 120 can use this information in the tree map visual 204 to show a user that there may be a problem with respect to a particular attribute, such as the United States of America. For example, a size of a section can increase and/or a color of the section can change based on the percentage determined based on the number of the voting agents that share the attribute and the total number of agents that share the attribute and/or based on whether the percentage exceeds the predetermined threshold percentage.

In the example of FIG. 2A, the section representing attribute 224(3) has an increased size and/or a different color, and thus, attribute 224(3) is likely associated with anomalous activity. Accordingly, the user may select attribute 224(3) in order to better understand the problem, as referenced by 230.

Based on the user selection, both the radar-based visual and the tree map visual can be updated and/or rebuilt using data associated with the metrics that include attribute 224(3), as referenced by 232 and 234 in FIG. 2B. For example, if attribute 224(3) relates to the United States of America, then the radar-based visual is re-generated by the self-organizing radar (SOR) algorithm 206 using metrics that include the United States of America as an attribute. Moreover, the tree map visual is updated to display sections representing the metrics 236(1)-236(L) (where L in the context of FIG. 2B is a positive integer number) that include the United States of America as an attribute. In this way, the user can dig deeper to see where a problem is located (e.g., with regard to metric 236(2) and/or 236(3)).

In some examples, the object on the radar-based visual will not change locations when the radar-based visual is updated, but rather the background colored regions are redistributed. However, in other examples, the background colored regions can remain static and the object can change locations (e.g., the object is replotted).

FIG. 3 illustrates the use of the Quantile Loss Gradient Boosted Trees (QLGBT) model-based multi-agent system that can be used to generate a tree map visual. For each category of metric (e.g., checkouts), a set of highly correlated metrics as independent regressors is determined. For each correlated pair of metrics, a time-independent generalized linear model (GLM) with polynomial relationships is fitted, as referenced by 302. Models other than a generalized linear model can be used as well.

Using the regression error as a target along with temporal and exogenous variables, two gradient boosted tree (GBT) models are fitted around the prediction error, an example of which is referenced by 304. One gradient boosted tree (GBT) model can correspond to an upper bound and another gradient boosted tree (GBT) model can correspond to a lower bound.

Using the gradient of a quantile loss/objective function in the boosting process, intervals on the predicted error are obtained. The end result is a pipeline that provides thresholds on the error, which is calculated from the predicted data value for the metric compared to the actual data value for the metric, as referenced by 306. This process may be referred to as Dynamic Error Thresholding. This result is obtained for a large number (e.g., hundreds) of models utilizing correlated pairs of metrics.

A single agent can be the embodiment of a correlated pair of metrics, the attributes of the metrics, the polynomial regression model(s), and the corresponding QLGBT models (or QL-DNN models) which drives the agent's decision. The agent's decision objective is whether to vote and which metric from the pair of metrics to vote for as being anomalous.

Votes are then collected from the agents. When an agent votes, it means that the error in predicting a data value for a metric was outside a permissible error range. As referenced by 308, the problem can be localized by comparing a number of voting agents that share an attribute to a total number of agents that share the attribute. The comparison yields a percentage, and as the percentage increases, a likelihood that a problem is localized to the attribute also increases.

FIG. 4A is a diagram 400 that illustrates an example of how the multi-agent system can be used to identify an attribute that may be associated with anomalous activity. FIG. 4A illustrates six prediction models 402(1-6) within which at least one of the two correlated metrics includes a same attribute ‘X’ (e.g., a same attribute is shared by a metric in each of the prediction models 402(1-6)). Six prediction models are illustrated for ease of discussion. However, hundreds or thousands of prediction models may likely be associated with each attribute via the correlated metrics contained therein. Note that the prediction models 402(1-6) may include completely different metrics or overlapping metrics (e.g., the same metric may be used in more than one of the prediction models 402(1-6)).

As described above, prediction model 402(1) may determine that an error in predicting a metric value is outside a bound (e.g., the upper or the lower bound) associated with a metric that includes attribute ‘X’ 404(1). Similarly, prediction model 402(2) may determine that an error in predicting a metric value is outside a bound associated with a metric that includes attribute ‘X’ 404(2), prediction model 402(4) may determine that an error in predicting a metric value is outside a bound associated with a metric that includes attribute ‘X’ 404(4), and prediction model 402(6) may determine that an error in predicting a metric value is outside a bound associated with a metric that includes attribute ‘X’ 404(6). Accordingly, prediction models 402(1), 402(2), 402(4), 402(6), as well as the error detection mechanisms described herein, become voting agents and submit votes 406 for attribute ‘X’. The votes signal anomalous activity associated with attribute ‘X’.

However, prediction model 402(3) produces an error that is within a bound (e.g., the upper or the lower bound) associated with a metric that includes attribute ‘X’ 404(3). Similarly, prediction model 402(5) produces an error that is within a bound (e.g., the upper or the lower bound) associated with a metric that includes attribute ‘X’ 404(5). Accordingly, prediction models 402(3), 402(5) are not associated with agents that provide a vote. Consequently, four agents vote for attribute ‘X’ out of a total of six possible agents 408.

Continuing on to FIG. 4B, a percentage 410 is produced based on the number of voting agents and a total number or agents that share the attribute ‘X’ (e.g., 66.6%). The percentage is compared to a threshold percentage 412 to determine whether a problem is likely localized to attribute ‘X’. For example, if the threshold percentage is fifty percent, then the multi-agent voting system determines that a problem is likely localized to attribute ‘X’ 414 (e.g., because 66.6% is greater than or equal to 50%). If the threshold percentage is seventy-five percent, then the multi-agent voting system lacks confidence to localize a problem to attribute ‘X’ 416 (e.g., because 66.6% is less than 75%).

The multi-agent voting system can use the percentage determined for attribute ‘X’ and/or whether the percentage exceeds the threshold percentage as a basis for scaling a section in the tree map visual. For instance, a size of a section that represents attribute ‘X’ can increase and/or change colors as the percentage increases.

In various examples, user feedback can be used to update the QLGBT models used to drive the voting and/or the threshold percentage 412 used to determine whether a problem is localized to a particular attribute. For example, a user may provide feedback 418 indicating that there is no problem associated with attribute ‘X’ after the user provides further investigation based on the system's determination that a problem is likely localized to attribute ‘X’ 414 (e.g., a false positive). Accordingly, the system can increase the threshold percentage for the particular attribute (e.g., attribute ‘X’) or the attribute type based on the feedback 418. Additionally or alternatively, the system can adjust how the bounds are determined for prediction models 402(1-6) by a predetermined factor (e.g., 5%, 10%, etc.) so a size of the confidence interval for the error is reduced.

In another example, a user may provide feedback 420 indicating that there is a problem associated with attribute ‘X’ and the system failed to inform the user that the problem is likely localized to attribute ‘X’ 416 (e.g., a false negative). Accordingly, the system can decrease the threshold percentage for the particular attribute (e.g., attribute ‘X’) or the attribute type based on the feedback 420. Additionally or alternatively, the system can adjust how the bounds are determined for prediction models 402(1-6) by a predetermined factor (e.g., 5%, 10%, etc.) so a size of the confidence interval for the error is expanded.

This system of multiple agents collaborating in parallel becomes highly customizable which enables continuous improvement and online learning. One example of such continuous improvement is that different voting agents can be assigned different weights, which can be updated over time based on user feedback. Another example of such continuous improvement is instead of using weights, a voting agent's upper and lower bounds can be multiplied by a relaxation factor to limit or increase its alerting frequency. Additionally, a user may choose to select one or more of the voting agents and remove them entirely from the system or replace them with new, separately trained, voting agents. This may be implemented, for example, to include newly added metrics to the system without resetting the training and deployment of the entire system.

FIGS. 5 and 6A-6B are diagrams illustrating aspects of methods 500 and 600 for implementing some of the techniques disclosed herein. It should be understood by those of ordinary skill in the art that the operations of the methods disclosed herein are not necessarily presented in any particular order and that performance of some or all of the operations in an alternative order(s) is possible and is contemplated. The operations have been presented in the demonstrated order for ease of description and illustration. Operations may be added, omitted, performed together, and/or performed simultaneously, without departing from the scope of the appended claims.

It should also be understood that the illustrated methods can end at any time and need not be performed in their entireties. Some or all operations of the methods, and/or substantially equivalent operations, can be performed by execution of computer-readable instructions included on a computer-readable storage media, as defined herein. The term “computer-readable instructions,” and variants thereof, as used in the description and claims, is used expansively herein to include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like. Although the example methods described below are operating on a computing device, it can be appreciated that these methods can be performed on any computing system which may include a number of computing devices (e.g., servers) working in concert to perform the operations disclosed herein.

Thus, it should be appreciated that the logical operations described herein are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system such as those described herein and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof.

The method 500 begins at operation 502, which illustrates receiving data associated with a plurality of metrics being monitored in association with a platform. For example, the metrics being monitored may relate to a particular function or process of the platform, such as the ability for customers to purchase items via an electronic commerce site. In a more specific example, the metrics being monitored may relate to one or more of a customer being able to sign in, a customer being able to check out, and so forth.

Operation 504 illustrates generating a first visual that represents the data. As described above, the first visual can include a radar-based visual that renders a heatmap and an object representing the plurality of metrics. The radar-based visual can provide a high-level indication of whether the platform is healthy based on the plurality of metrics. That is, movement of the object within the radar-based visual can signal normal activity or abnormal activity with respect to the plurality of metrics. Additionally or alternatively, a size of the object can signal normal activity or abnormal activity with respect to the plurality of metrics.

Operation 506 illustrates generating a second visual that represents the data. The second visual can include a tree map visual that includes a first plurality of sections and each section of the first plurality of sections is associated with an attribute used to compose one or more of the plurality of metrics. As described in examples above, an attribute can relate to a location, a type of device, a type of payment method, and so forth.

Operation 508 illustrates receiving user input that selects a section of the first plurality of sections. The user input may be provided in order to help a user attempt to localize a potential problem to a particular metric and/or a particular attribute.

Operation 510 illustrates identifying, based at least on the user input, a subset of the plurality of metrics related to the attribute associated with the section selected.

Operation 512 illustrates updating the first visual to re-render the heatmap and the object representing the subset of the plurality of metrics. This updated first visual can provide the user with a high-level indication of whether the platform is experiencing anomalous activity with respect to the subset of the plurality of metrics.

Operation 514 illustrates updating the second visual to include a second plurality of sections. An individual section in the second plurality of sections is associated with an individual metric in the subset of the plurality of metrics.

Switching to FIGS. 6A-6B, the method 600 begins at operation 602, which illustrates evaluating a plurality of metrics to determine that a first metric is correlated to a second metric. Accordingly, this evaluation operation may yield a set of correlated metrics for a particular function or process the platform implements (e.g., checkouts).

Operation 604 illustrates generating a prediction model for the first metric and the second metric. Provided an actual data value for a first metric, the prediction model is configured to predict a data value for the second metric.

Operation 606 illustrates obtaining errors of the prediction model. In some examples, exogenous variables can be used to obtain the errors.

Operation 608 determines an upper bound and a lower bound on the errors of the prediction model. In one example, the upper bound and the lower bound are determined using Quantile-Loss Gradient Boosted Tree error threshold models. The upper bound may be associated with the first metric and the lower bound may be associated with the second metric. Alternatively, the upper bound may be associated with the second metric and the lower bound may be associated with the first metric.

Operation 610 illustrates using the prediction model to predict a data value for the second metric provided an actual data value for the first metric.

Operation 612 illustrates comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference.

Operation 614 illustrates determining that the difference is outside either the upper bound or the lower bound. When the difference is outside the upper bound, an agent associated with the prediction model provides a vote signaling an anomaly associated with the first metric or the second metric, whichever one is associated with the upper bound. That is, the agent is a voting agent and the metric for which an anomaly is signaled is a voted metric. When the difference is outside the lower bound, an agent associated with the prediction model provides a vote signaling an anomaly associated with the other one of the first metric or the second metric. That is, the agent is still the voting agent but the other metric for which the anomaly is signaled is the voted metric. In the example described above, an agent can provide three possible decision outputs: a vote for a first metric, a vote for a second metric, or a vote is not provided.

Operation 616 illustrates that the anomaly associated with the voted metric is displayed. For example, the vote signaling the anomaly can be used as a basis to generate the radar-based visual and/or the tree map visual.

The following operations in FIG. 6B can be used to determine whether a problem is likely localized to a particular attribute included in the voted metric and/or to alter the displayed visuals based on the determination.

Operation 620 illustrates determining a total number of agents that have a metric that shares a particular attribute with the voted metric.

Operation 622 illustrates identifying a number of voting agents, from the total number of agents, that provided a vote signaling anomalous activity.

Operation 624 illustrates generating a percentage for the particular attribute based on the number of voting agents and the total number of agents.

Operation 626 illustrates determining whether the percentage exceeds a threshold percentage (e.g., a percentage between seventy and ninety percent).

Operation 628 illustrates determining whether a problem is localized to the particular attribute based on the determining whether the percentage exceeds the threshold percentage. For instance, if the generated percentage meets or exceeds the threshold percentage, a problem is likely associated with the particular attribute. If the generated percentage is less than the threshold percentage, a problem is not likely associated with the particular attribute (e.g., the voting agent may have identified an outlier instance).

Operations 620 through 628 may be repeated for each of the attributes included in a voted metric, so that a problem can be localized to an attribute that is likely causing the problem.

FIG. 7 shows an example computer architecture for a computer capable of providing the functionality described herein such as, for example, a computing device configured to implement the functionality described above with reference to FIGS. 1-6. Thus, the computer architecture 700 illustrated in FIG. 7 illustrates an architecture for a server computer or another type of computing device suitable for implementing the functionality described herein. The computer architecture 700 might be utilized to execute the various software components presented herein to implement the disclosed technologies.

The computer architecture 700 illustrated in FIG. 7 includes a central processing unit 702 (“CPU”), a system memory 704, including a random-access memory 706 (“RAM”) and a read-only memory (“ROM”) 708, and a system bus 710 that couples the memory 704 to the CPU 702. A firmware containing basic routines that help to transfer information between elements within the computer architecture 700, such as during startup, is stored in the ROM 708. The computer architecture 700 further includes a mass storage device 712 for storing an operating system 714, other data, and one or more executable programs including, for example, the SOR module 118 and/or the voting module 120.

The mass storage device 712 is connected to the CPU 702 through a mass storage controller (not shown) connected to the bus 710. The mass storage device 712 and its associated computer-readable media provide non-volatile storage for the computer architecture 700. Although the description of computer-readable media contained herein refers to a mass storage device, such as a solid-state drive, a hard disk or optical drive, it should be appreciated by those skilled in the art that computer-readable media can be any available computer storage media or communication media that can be accessed by the computer architecture 700.

Communication media includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics changed or set in a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.

By way of example, and not limitation, computer-readable storage media might include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, digital versatile disks (“DVD”), HD-DVD, BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer architecture 700. For purposes of the claims, the phrase “computer storage medium,” “computer-readable storage medium” and variations thereof, does not include waves, signals, and/or other transitory and/or intangible communication media, per se.

According to various implementations, the computer architecture 700 might operate in a networked environment using logical connections to remote computers through a network 750. A computing device implementing the computer architecture 700 might connect to the network 750 through a network interface unit 716 connected to the bus 710. It should be appreciated that the network interface unit 716 might also be utilized to connect to other types of networks and remote computer systems.

The computer architecture 700 might also include an input/output controller 718 for receiving and processing input from a number of other devices, including a keyboard, mouse, or electronic stylus. Similarly, the input/output controller 718 might provide output to a display screen, a printer, or other type of output device.

It should be appreciated that the software components described herein might, when loaded into the CPU 702 and executed, transform the CPU 702 and the overall computer architecture 700 from a general-purpose computing system into a special-purpose computing system customized to facilitate the functionality presented herein. The CPU 702 might be constructed from any number of transistors or other discrete circuit elements, which might individually or collectively assume any number of states. More specifically, the CPU 702 might operate as a finite-state machine, in response to executable instructions contained within the software modules disclosed herein. These computer-executable instructions might transform the CPU 702 by specifying how the CPU 702 transitions between states, thereby transforming the transistors or other discrete hardware elements constituting the CPU 702.

Encoding the software modules presented herein might also transform the physical structure of the computer-readable media presented herein. The specific transformation of physical structure might depend on various factors, in different implementations of this description. Examples of such factors might include, but are not limited to, the technology used to implement the computer-readable media, whether the computer-readable media is characterized as primary or secondary storage, and the like. If the computer-readable media is implemented as semiconductor-based memory, the software disclosed herein might be encoded on the computer-readable media by transforming the physical state of the semiconductor memory. For example, the software might transform the state of transistors, capacitors, or other discrete circuit elements constituting the semiconductor memory. The software might also transform the physical state of such components in order to store data thereupon.

As another example, the computer-readable media disclosed herein might be implemented using magnetic or optical technology. In such implementations, the software presented herein might transform the physical state of magnetic or optical media, when the software is encoded therein. These transformations might include altering the magnetic characteristics of locations within given magnetic media. These transformations might also include altering the physical features or characteristics of locations within given optical media, to change the optical characteristics of those locations. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this discussion.

In light of the above, it should be appreciated that many types of physical transformations take place in the computer architecture 700 in order to store and execute the software components presented herein. It also should be appreciated that the computer architecture 700 might include other types of computing devices, including hand-held computers, embedded computer systems, personal digital assistants, and other types of computing devices known to those skilled in the art.

It is also contemplated that the computer architecture 700 might not include all of the components shown in FIG. 7, might include other components that are not explicitly shown in FIG. 7, or might utilize an architecture completely different than that shown in FIG. 7. For example, and without limitation, the technologies disclosed herein can be utilized with multiple CPUs for improved performance through parallelization, graphics processing units (“GPUs”) for faster computation, and/or tensor processing units (“TPUs”). The term “processor” as used herein encompasses CPUs, GPUs, TPUs, and other types of processors.

The disclosure presented herein also encompasses the subject matter set forth in the following clauses.

Example Clause A, a method comprising: determining that a first metric is correlated to a second metric; generating, by one or more devices, a prediction model for the first metric that is correlated to the second metric; obtaining errors of the prediction model; determining an upper bound and a lower bound on the errors of the prediction model; using the prediction model to predict a data value for the second metric from an actual data value for the first metric; comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference; determining that the difference is outside either the upper bound or the lower bound resulting in a voting agent signaling an anomaly associated with a voted metric; and displaying the anomaly associated with the voted metric.

Example Clause B, the method of Example Clause A, further comprising: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is greater than or equal to a threshold percentage; and determining that a problem associated with the anomaly is localized to the attribute based on the percentage being greater than or equal to the threshold percentage.

Example Clause C, the method of Example Clause B, wherein the attribute comprises one of a specific location, a type of device, or a type of payment method.

Example Clause D, the method of Example Clause B, wherein the attribute is related to a list of items being sold by a user of an electronic commerce site.

Example Clause E, the method of Example Clause A, further comprising: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is less than a threshold percentage; and determining that a problem associated with the anomaly is not localized to the attribute based on the percentage being less than the threshold percentage.

Example Clause F, the method of any one of Example Clauses A through E, wherein: a first Quantile-Loss Gradient Boosted Tree (QLGBT) error thresholding model is used to determine the upper bound on the errors of the prediction model; and a second QLGBT error thresholding model is used to determine the lower bound on the errors of the prediction model.

Example Clause G, the method of Example Clause F, further comprising categorizing one of the first QLGBT error thresholding model or the second QLGBT error thresholding model as the voting agent which signals the anomaly associated with the voted metric.

Example Clause H, the method of any one of Example Clauses A through G, further comprising evaluating a plurality of metrics to determine that the first metric is correlated to the second metric.

Example Clause I, a system comprising: one or more processing units; and computer-readable storage media storing instructions that, when executed by the one or more processing units, cause the system to perform operations comprising: determining that a first metric is correlated to a second metric; generating a prediction model for the first metric that is correlated to the second metric; obtaining errors of the prediction model; determining an upper bound and a lower bound on the errors of the prediction model; using the prediction model to predict a data value for the second metric from an actual data value for the first metric; comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference; determining that the difference is outside either the upper bound or the lower bound resulting in a voting agent signaling an anomaly associated with a voted metric; and displaying the anomaly associated with the voted metric.

Example Clause J, the system of Example Clause I, wherein the operations further comprise: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is greater than or equal to a threshold percentage; and determining that a problem associated with the anomaly is localized to the attribute based on the percentage being greater than or equal to the threshold percentage.

Example Clause K, the system of Example Clause J, wherein the attribute comprises one of a specific location, a type of device, or a type of payment method.

Example Clause L, the system of Example Clause J, wherein the attribute is related to a list of items being sold by a user of an electronic commerce site.

Example Clause M, the system of Example Clause I, wherein the operations further comprise: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is less than a threshold percentage; and determining that a problem associated with the anomaly is not localized to the attribute based on the percentage being less than the threshold percentage.

Example Clause N, the system of any one of Example Clauses I through M, wherein: a first Quantile-Loss Gradient Boosted Tree (QLGBT) error thresholding model is used to determine the upper bound on the errors of the prediction model; and a second QLGBT error thresholding model is used to determine the lower bound on the errors of the prediction model.

Example Clause O, the system of Example Clause N, wherein the operations further comprise categorizing one of the first QLGBT error thresholding model or the second QLGBT error thresholding model as the voting agent which signals the anomaly associated with the voted metric.

Example Clause P, the system of any one of Example Clauses I through O, further comprising evaluating a plurality of metrics to determine that the first metric is correlated to the second metric.

Example Clause Q, computer-readable storage media comprising instructions that, when executed by one or more processing units, cause a system to perform operations comprising: determining that a first metric is correlated to a second metric; generating a prediction model for the first metric that is correlated to the second metric; obtaining errors of the prediction model; determining an upper bound and a lower bound on the errors of the prediction model; using the prediction model to predict a data value for the second metric from an actual data value for the first metric; comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference; determining that the difference is outside either the upper bound or the lower bound resulting in a voting agent signaling an anomaly associated with a voted metric; and displaying the anomaly associated with the voted metric.

Example Clause R, the computer-readable storage media of Example Clause Q, wherein the operations further comprise: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is greater than or equal to a threshold percentage; and determining that a problem associated with the anomaly is localized to the attribute based on the percentage being greater than or equal to the threshold percentage.

Example Clause S, the computer-readable storage media of Example Clause R, wherein the attribute comprises one of a specific location, a type of device, or a type of payment method.

Example Clause T, the computer-readable storage media of any one of Example Clauses Q through S, wherein: a first Quantile-Loss Gradient Boosted Tree (QLGBT) error thresholding model is used to determine the upper bound on the errors of the prediction model; a second QLGBT error thresholding model is used to determine the lower bound on the errors of the prediction model; and the operations further comprise categorizing one of the first QLGBT error thresholding model or the second QLGBT error thresholding model as the voting agent which signals the anomaly associated with the voted metric.

The terms “a,” “an,” “the” and similar referents used in the context of describing the techniques (especially in the context of the following claims) are to be construed to cover both the singular and the plural unless otherwise indicated herein or clearly contradicted by context. The terms “based on,” “based upon,” and similar referents are to be construed as meaning “based at least in part” which includes being “based in part” and “based in whole” unless otherwise indicated or clearly contradicted by context.

It should be appreciated that any reference to “first,” “second,” etc. elements within the Summary and/or Detailed Description is not intended to and should not be construed to necessarily correspond to any reference of “first,” “second,” etc. elements of the claims. Rather, any use of “first” and “second” within the Summary, Detailed Description, and/or claims may be used to distinguish between two different instances of the same element (e.g., two different metrics, two different visuals, etc.).

It is to be appreciated that conditional language used herein such as, among others, “can,” “could,” “might” or “may,” unless specifically stated otherwise, are understood within the context to present that certain examples include, while other examples do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that certain features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether certain features, elements and/or steps are included or are to be performed in any particular example. Conjunctive language such as the phrase “at least one of X, Y or Z,” unless specifically stated otherwise, is to be understood to present that an element may be either X, Y, or Z, or a combination thereof.

It should also be appreciated that many variations and modifications may be made to the above-described examples, the elements of which are to be understood as being among other acceptable examples. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.

In closing, although the various configurations have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter. 

What is claimed is:
 1. A method comprising: determining that a first metric is correlated to a second metric; generating, by one or more devices, a prediction model for the first metric that is correlated to the second metric; obtaining errors of the prediction model; determining an upper bound and a lower bound on the errors of the prediction model; using the prediction model to predict a data value for the second metric from an actual data value for the first metric; comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference; determining that the difference is outside either the upper bound or the lower bound resulting in a voting agent signaling an anomaly associated with a voted metric; and displaying the anomaly associated with the voted metric.
 2. The method of claim 1, further comprising: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is greater than or equal to a threshold percentage; and determining that a problem associated with the anomaly is localized to the attribute based on the percentage being greater than or equal to the threshold percentage.
 3. The method of claim 2, wherein the attribute comprises one of a specific location, a type of device, or a type of payment method.
 4. The method of claim 2, wherein the attribute is related to a list of items being sold by a user of an electronic commerce site.
 5. The method of claim 1, further comprising: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is less than a threshold percentage; and determining that a problem associated with the anomaly is not localized to the attribute based on the percentage being less than the threshold percentage.
 6. The method of claim 1, wherein: a first Quantile-Loss Gradient Boosted Tree (QLGBT) error thresholding model is used to determine the upper bound on the errors of the prediction model; and a second QLGBT error thresholding model is used to determine the lower bound on the errors of the prediction model.
 7. The method of claim 6, further comprising categorizing one of the first QLGBT error thresholding model or the second QLGBT error thresholding model as the voting agent which signals the anomaly associated with the voted metric.
 8. The method of claim 1, further comprising evaluating a plurality of metrics to determine that the first metric is correlated to the second metric.
 9. A system comprising: one or more processing units; and computer-readable storage media storing instructions that, when executed by the one or more processing units, cause the system to perform operations comprising: determining that a first metric is correlated to a second metric; generating a prediction model for the first metric that is correlated to the second metric; obtaining errors of the prediction model; determining an upper bound and a lower bound on the errors of the prediction model; using the prediction model to predict a data value for the second metric from an actual data value for the first metric; comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference; determining that the difference is outside either the upper bound or the lower bound resulting in a voting agent signaling an anomaly associated with a voted metric; and displaying the anomaly associated with the voted metric.
 10. The system of claim 9, wherein the operations further comprise: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is greater than or equal to a threshold percentage; and determining that a problem associated with the anomaly is localized to the attribute based on the percentage being greater than or equal to the threshold percentage.
 11. The system of claim 10, wherein the attribute comprises one of a specific location, a type of device, or a type of payment method.
 12. The system of claim 10, wherein the attribute is related to a list of items being sold by a user of an electronic commerce site.
 13. The system of claim 9, wherein the operations further comprise: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is less than a threshold percentage; and determining that a problem associated with the anomaly is not localized to the attribute based on the percentage being less than the threshold percentage.
 14. The system of claim 9, wherein: a first Quantile-Loss Gradient Boosted Tree (QLGBT) error thresholding model is used to determine the upper bound on the errors of the prediction model; and a second QLGBT error thresholding model is used to determine the lower bound on the errors of the prediction model.
 15. The system of claim 14, wherein the operations further comprise categorizing one of the first QLGBT error thresholding model or the second QLGBT error thresholding model as the voting agent which signals the anomaly associated with the voted metric.
 16. The system of claim 9, further comprising evaluating a plurality of metrics to determine that the first metric is correlated to the second metric.
 17. Computer-readable storage media comprising instructions that, when executed by one or more processing units, cause a system to perform operations comprising: determining that a first metric is correlated to a second metric; generating a prediction model for the first metric that is correlated to the second metric; obtaining errors of the prediction model; determining an upper bound and a lower bound on the errors of the prediction model; using the prediction model to predict a data value for the second metric from an actual data value for the first metric; comparing an actual data value for the second metric to the predicted data value for the second metric to determine a difference; determining that the difference is outside either the upper bound or the lower bound resulting in a voting agent signaling an anomaly associated with a voted metric; and displaying the anomaly associated with the voted metric.
 18. The computer-readable storage media of claim 17, wherein the operations further comprise: determining a total number of agents that share an attribute with the voted metric; determining a number of voting agents from the total number of agents; generating a percentage for the attribute based on the number of voting agents and the total number of agents; determining that the percentage is greater than or equal to a threshold percentage; and determining that a problem associated with the anomaly is localized to the attribute based on the percentage being greater than or equal to the threshold percentage.
 19. The computer-readable storage media of claim 18, wherein the attribute comprises one of a specific location, a type of device, or a type of payment method.
 20. The computer-readable storage media of claim 17, wherein: a first Quantile-Loss Gradient Boosted Tree (QLGBT) error thresholding model is used to determine the upper bound on the errors of the prediction model; a second QLGBT error thresholding model is used to determine the lower bound on the errors of the prediction model; and the operations further comprise categorizing one of the first QLGBT error thresholding model or the second QLGBT error thresholding model as the voting agent which signals the anomaly associated with the voted metric. 